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Abstract 

Many modem data analysis problems involve inferences from streaming data. How¬ 
ever, streaming data is not easily amenable to the standard probabilistic modeling 
approaches, which assume that we condition on finite data. We develop population 
variational Bayes, a new approach for using Bayesian modeling to analyze streams 
of data. It approximates a new type of distribution, the population posterior, which 
combines the notion of a population distribution of the data with Bayesian inference 
in a probabilistic model. We study our method with latent Dirichlet allocation and 
Dirichlet process mixtures on several large-scale data sets. 


1 Introduction 

Probabilistic modeling has emerged as a powerful tool for data analysis. It is an intuitive language 
for describing assumptions about data and provides efficient algorithms for analyzing real data under 
those assumptions. The main idea comes from Bayesian statistics. We encode our assumptions about 
the data in a structured probability model of hidden and observed variables; we condition on a data 
set to reveal the posterior distribution of the hidden variables; and we use the resulting posterior as 
needed, for example to form predictions through the posterior predictive distribution or to explore the 
data through the posterior expectations of the hidden variables. 

Many modern data analysis problems involve inferences from streaming data. Examples include 
exploring the content of massive social media streams (e.g., Twitter, Facebook), analyzing live video 
streams, modeling the preferences of users on an online platform for recommending new items, and 
predicting human mobility patterns for anticipatory computing. Such problems, however, cannot 
easily take advantage of the standard approach to probabilistic modeling, which always assumes 
that we condition on a finite data set. This might be surprising to some readers; after all, one of the 
tenets of the Bayesian paradigm is that we can update our posterior when given new information. 
(“Yesterday’s posterior is today’s prior.”). But there are two problems with using Bayesian updating 
on data streams. 

The first problem is that Bayesian posteriors will become overconfident. Conditional on never- 
ending data, most posterior distributions (with a few exceptions) result in a point mass at a single 
configuration of the latent variables. That is, the posterior contains no variance around its idea of 
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the hidden variables that generated the observed data. In theory this is sensible, but only in the 
impossible scenario where the data truly came from the proposed model. In practice, all models 
provide approximations to the data-generating distribution, and we always harbor uncertainty about 
the model (and thus the hidden variables) even in the face of an inhnite data stream. 

The second problem is that the data stream might change over time. This is an issue because, 
frequently, our goal in applying probabilistic models to streams is not to characterize how they 
change, but rather to accommodate it. That is, we would like for our current estimate of the latent 
variables to be accurate to the current state of the stream and to adapt to how the stream might slowly 
change. (This is in contrast, for example, to time series modeling.) Traditional Bayesian updating 
cannot handle this. Either we explicitly model the time series, and pay a heavy inferential cost, or we 
tacitly assume that the data are independent and exchangeable. 

In this paper we develop new ideas for analyzing data streams with probabilistic models. Our 
approach combines the frequentist notion of the population distribution with probabilistic models and 
Bayesian inference. 

Main idea: The population posterior. Consider a latent variable model of a data points. (This 
is unconventional notation; we will describe why we use it below.) Following ifTSll . we define the 
model to have two kinds of hidden variables: global hidden variables j3 contain latent structure that 
potentially governs any data point; local hidden variables zi contain latent stmcture that only governs 
the ith data point. Such models are dehned by the joint, 

p{l5,z,x)=p{l5)Y[pixi,Zi\l5), ( 1 ) 

1=1 

where x = xi:a and z = zi-.a- Traditional Bayesian statistics conditions on a fixed data set x to obtain 
the posterior distribution of the hidden variables p (j3, z | x). As we discussed, this framework cannot 
accommodate data streams. We need a different way to use the model. 

We define a new distribntion, the population posterior, which enables ns to consider Bayesian 
modeling of streams. Suppose we observe a data points independently from the underlying population 
distribution, Fa- This induces a posterior p(j3,z | X), which is a function of the random data. 
The expected value of this posterior distribution is the population posterior, 

p(x) J ■ ^ ^ 

Notice that this distribution is not a function of observed data; it is a function of the population 
distribution F and the data set size a. The data set size is a parameter that can be set. This parameter 
controls the variance of the population posterior, and so depends on how close the model is to the 
true data distribution. 

We have defined a new problem. Given an endless stream of data points coming from F and a value 
for a, our goal is to approximate the corresponding population posterior. We will use variational 
inference and stochastic optimization to approximate the population posterior. As we will show, our 
algorithm justifies applying a variant of stochastic variational inference Ea to a data stream. We 
used this technique to analyze several data streams with modern probabilistic models, such as latent 
Dirichlet allocation and Dirichlet process mixtures. With held out likelihood as a measure of model 
fitness, we found our method to give better models of the data than approaches based on full Bayesian 
inference ifTSll or Bayesian updating HI. 

Related work. Several methods exist for performing inference on streams of data. Refs. EH 
l28l propose extending Markov chain Monte Carlo methods for streaming data. However, sampling- 
based approaches do not scale to massive datasets. The variational approximation enables more 
scalable inference. Ref. IfTO propose online variational inference by exponentially forgetting the 
variational parameters associated with old data. Ref. ca also decay parameters derived from old data, 
but interpret this action in the context of stochastic optimization, bringing guarantees of convergence 
to a local optimum. This gradient-based approach has enabled the application of more advanced 
probabilistic models to large-scale data sets. However, none of these methods are applicable to 
streaming data, because they implicitly rely on the data being of known size (even when based on 
subsampling data points to obtain noisy gradients). 

To apply the variational approximation to streaming data. Ref. JSl and Ref. IfTSll both propose per¬ 
forming Bayesian updating to the approximating family. Their method uses the latest approximation 
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having seen n data points as a prior to be updated using the approximation of the next data point (or 
mini-batch) using Bayes’ rule. Ref. Il2^ adapt this framework to nonparametric mixture modelling. 
Here we take a different approach, by changing the overall variational objective to incorporate a 
population distribution F and following stochastic gradients of this new objective. 

Independently, Ref. Il2^ apply SVI to streaming settings by accumulating new data points into a 
growing window, then uniformly sampling from this window to update the variational parameters. 
Our method justifies that approach. Further, they propose updating parameters along a trust region, 
instead of following (natural) gradients, as a way of mitigating local optima. This innovation can be 
incorporated into our method. 


2 Variational Inference for the Population Posterior 


We develop population variational Bayes, a method for approximating the population posterior in 
Eq. Our method is based on variational inference and stochastic optimization. 

The F-ELBO. The idea behind variational inference is to approximate difficult-to-compute distribu¬ 
tions through optimization unma. We introduce an approximating family of distributions over the 
latent variables ^(j3,z) and try to find the member of q{-) that minimizes the Kullback-Leibler (KL) 
divergence to the target distribution. 

Population variational Bayes (VB) uses variational inference to approximate the population posterior 
in Eq. It aims to solve the following problem, 

=minKL(^(/3,z)||Ef„[/7(/3,z|X)]). (3) 

As for the population posterior, this objective is a function of the population distribution of a data 
points Fa- Notice the difference to classical VB. In classical VB, we optimize the KL divergence 
between q{-) and a posterior, KL(g'(j3,z)||/5(j3,z | x)). Its objective is a function of a fixed data set x; 
the objective in Eq.[^is a function of the population distribution Fa- 

We will use the mean-field variational family, where each latent variable is independent and governed 
by a free parameter, 

= ( 4 ) 

1=1 

The free variational parameters are the global parameters X and local parameters 0,. Though we 
focus on the mean-field family, extensions could consider structured families lfT4ll22l . where there is 
dependence between variables. 

In classical VB, where we approximate the usual posterior, we cannot compute the KL. Thus, we 
optimize a proxy objective called the ELBO (evidence lower bound) that is equal to the negative KL 
up to an additive constant. Maximizing the ELBO is equivalent to minimizing the KL divergence to 
the posterior. 

In population VB we also optimize a proxy objective, the E-ELBO. The E-ELBO is an expectation of 
the ELBO under the population distribution of the data. 




E„ 


logp(^) - log^(^ I A) -f ^ \ogp{Xi,Zi I ^) - \ogq{Zi)] 


1=1 


( 5 ) 


The E-ELBO is a lower bound on the population evidence logE/?^ [f’(X)] and a lower bound on the 
negative KL to the population posterior. (See Appendix A.) The inner expectation is over the latent 
variables p and Z, and is a function of the variational distribution q{-). The outer expectation is over 
the a random data points X, and is a function of the population distribution Fa( ). The E-ELBO is 
thus a function of both the variational distribution and the population distribution. 

As we mentioned, classical VB maximizes the (classical) ELBO, which is equivalent to minimizing 
the KL. The E-ELBO, in contrast, is only a bound on the negative KL to the population posterior. 
Thus maximizing the E-ELBO is suggestive but is not guaranteed to minimize the KL. That said, our 
studies show that this is a good quantity to optimize and in Appendix A we show that the E-ELBO 
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does minimize [KL(^(z,)3)| |p(z,)3 |X))]. As we will see, the ability to sample from Fa is the only 
additional requirement for maximizing the F-ELBO, using stochastic gradients. 

Conditionally conjugate models. In the next section we will develop a stochastic optimization 
algorithm to maximize Eq. Eirst, we describe the class of models that we will work with. 

Eollowing IITSl we focus on conditionally conjugate models. A conditionally conjugate model is one 
where each complete conditional—the conditional distribution of a latent variable given all the other 
latent variables and the observations—is in the exponential family. This class includes many models 
in modern machine learning, such as mixture models, topic models, many Bayesian nonparametric 
models, and some hierarchical regression models. Using conditionally conjugate models simplifies 
many calculations in variational inference. 

Under the joint in Eq.[T] we can write a conditionally conjugate model with two exponential families: 

p{zi,Xi \ P) = h{zi,Xi)exp {P~'' t{zi,Xi) - a{P)} ( 6 ) 

p{l5\Q=h{l5)exp{Ct{l5)-a{Q}. (7) 

We overload notation for base measures h{-), sufficient statistics f (•), and log normalizers a(-). Note 
that is the hyperparameter and that f (j3) = [j3, —a(j3)] ||3]. 

In conditionally conjugate models each complete conditional is in an exponential family, and we 
use these families as the factors in the variational distribution in Eq. Thus X indexes the same 
family as p(P \ z,x) and (j), indexes the same family as p{zi \xi,j5). Eor example, in latent Dirichlet 
allocation 0, the complete conditional of the topics is a Dirichlet; the complete conditional of 
the per-document topic mixture is a Dirichlet; and the complete conditional of the per-word topic 
assignment is a categorical. (See ina for details.) 

Population variational Bayes. We have described the ingredients of our problem. We are given a 
conditionally conjugate model, described in Eqs.|6 and|7] a parameterized variational family in Eq.|^ 
and a stream of data from an unknown population distribution F. Our goal is to optimize the E-ELBO 
in Eq.j^with respect to the variational parameters. 

The E-ELBO is a function of the population distribution, which is an unknown quantity. To overcome 
this hurdle, we will use the stream of data from F to form noisy gradients of the E-ELBO; we then 
update the variational parameters with stochastic optimization. 

Before describing the algorithm, however, we acknowledge one technical detail. Mirroring ifTSl . we 
optimize an E-ELBO that is only a function of the global variational parameters. The one-parameter 
population VI objective is (A) = maX||, (A, 0). This implicitly optimizes the local parameter 
as a function of the global parameter. The resulting objective is identical to Eq.[^ but with (j) replaced 
by (^(A). (Details are in Appendix B). 


The next step is to form a noisy gradient of the E-ELBO so that we can use stochastic optimization 
to maximize it. Stochastic optimization maximizes an objective by following noisy and unbiased 
gradients GlEIl. We will write the gradient of the E-ELBO as an expectation with respect to Fa, and 
then use Monte Carlo estimates to form noisy gradients. 


We compute the gradient of the E-ELBO by bringing the gradient operator inside the expectations of 


Eq. 5 ‘ This results in a population expectation of the classical VB gradient with a data points. 


We take the natural gradient 0, which has a simple form in completely conjugate models na. 
Specifically, the natural gradient of the E-ELBO is 


V;,if(A;E„)=Ef„ 


a 


r=l 


(8) 


We use this expression to compute noisy natural gradients at A. We collect a data points from E; for 
each we compute the optimal local parameters </>, (A), which is a function of the sampled data point 
and variational parameters; we then compute the quantity inside the brackets in Eq. The result 
is a single-sample Monte-Carlo estimate of the gradient, which we can compute from a stream of 
data. We follow the noisy gradient and repeat. This algorithm is summarized in Algorithm. [T] Since 

^For most models of interest, this is justified by the dominated convergence theorem. 
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Algorithm 1 Population Variational Bayes 

Randomly initialize global variational parameter A 
Set iteration t i—Q 

repeat 

Draw data minibatch xi-b ^ Fa 
Optimize local variational parameters ..., 

Calculate natural gradient [see Eq.[^ 

Update global variational parameter with learning rate p 

=AW+pWfV;L=S?(AW;Ua) 

Update iteration count f •(— f + 1 
until forever 


Equationj^is a Monte-Carlo estimate, we are free to draw B data points from Fa (where B « a) 
and rescale the sufficient statistics by a/B. This makes the gradient estimate faster to calculate, but 
more noisy. As highlighted in Ref. 031 . this makes sense because early iterations of the algorithm 
have inaccurate values of A so it is wasteful to pass through lots of data before making updates to A. 

Discussion. Thus far, we have defined the population posterior and showed how to approximate 
it with population variational inference. Our derivation justifies using an algorithm like stochastic 
variational inference (SVI) ifTSl on a stream of data. It is nearly identical to SVI, but includes an 
additional parameter; the number of data points in the population posterior a. 

Note we can recover the original SVI algorithm as an instance of population VI, thus reinterpreting it 
as minimizing the KL divergence to the population posterior. We recover SVI by setting a equal to 
the number of data points in the data set and replacing the stream of data F with F^, the empirical 
distribution of the observations. The “stream” in this case comes from sampling with replacement 
from Ex, which results in precisely the original SVI algorithm]^ 

We focused on the conditionally conjugate family for convenience, i.e., the simple gradient in Eq.[^ 
We emphasize, however, that by using recent tools for nonconjugate inference ifTSl l20l l26l . we 
can adapt the new ideas described above—the population posterior and the F-ELBO—outside of 
conditionally conjugate models. 

3 Empirical Evaluation 

We study the performance of population variational Bayes (population VB) against SVI and SVB ||8). 
With large real-world data we study two models, latent Dirichlet allocation m and Bayesian nonpara- 
metric mixture models, comparing the held-out predictive performance of the algorithms. We study 
the data coming in a true ordered stream, and in a permuted stream (to better match the assumptions 
of SVI). Across data and models, population VB usually outperforms the existing approaches. 

Models. We study two models. The first is latent Dirichlet allocation (EDA) Q. EDA is a 
mixed-membership model of text collections and is frequently used to find its latent topics. EDA 
assumes that there are K topics ~ Dir(Tj), each of which is a multinomial distribution over a fixed 
vocabulary. Documents are drawn by first choosing a distribution over topics 9d ~ Dir(a) and then 
drawing each word by choosing a topic assignment Zdn ^ Mult(0;;) and finally choosing a word from 
the corresponding topic Wdn ^ Pz^,, ■ The joint distribution is 

p{p,0,z,w\ri,Y) = p{p\ri)Y[piOd\r)Y[pizdi\Od)p{wdi\P,Zdi). (9) 

d=\ !=1 

Fixing hyperparameters, the inference problem is to estimate the conditional distribution of the topics 
given a large collection of documents. 

^This derivation of SVI is an application of Efron’s plug-in principle CD applied to inference of the 
population posterior. The plug-in principle says that we can replace the population F with the empirical 
distribution of the data F to make population inferences. In our empirical study, however, we found that 
population VB often outperforms SVI. Treating the data in a true stream, and setting the number of data points 
different to the true number, can improve predictive accuracy. 
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Figure 1 : Held out predictive log likelihood for LDA on large-scale streamed text corpora. Populatlon- 
VB outperforms existing methods for two out of the three settings. We use the best settings of a. 


The second model is a Dirichlet process (DP) mixture ifT^ . Loosely, DP mixtures are mixture 
models with a potentially Infinite number of components; thus choosing the number of components 
is part of the posterior inference problem. When using variational for DP mixtures ||4l, we take 
advantage of the stick breaking representation. The variables are mixture proportions n ^ Stick(T 7 ), 
mixture components ~ H{y) (for infinite k), mixture assignments Zi ~ Mult(;r), and observations 
Xi ~ The joint is 

p{l5,7i:,z,x\ri,Y) = p{7z\ri)p{l5\Y)Y[p{zi\7i:)p{xi\l5,Zi). (10) 

i=l 

The likelihood and prior on the components are general to the observations at hand. In our study 
of continuous data we use normal priors and normal likelihoods; in our study of text data we use 
Dirichlet priors and multinomial likelihoods. 

For both models, we use a, which corresponds to the number of data points in traditional analysis. 

Datasets. With LDA we analyze three large-scale streamed corpora: 1.7M articles from the New 
York Times spanning 10 years, 130K Science articles written over 100 years, and 7.4M tweets 
collected from Twitter on Feb 2nd, 2014. We processed them all in a similar way, choosing a 
vocabulary based on the most frequent words in the corpus (with stop words removed): 8,000 for the 
New York Times, 5,855 for Science, and 13,996 for Twitter. On Twitter, we each tweet is a document, 
and we removed duplicate tweets and tweets that did not contain at least 2 words in the vocabulary. 

With DP mixtures, we analyze human location behavior data. These data allow us to build periodic 
model^of human population mobility, with applications to disaster response and urban planning. 
The Ivory Coast location data contains 18M discrete cell tower locations for 500K users recorded 
over 6 months O. The Microsoft Geolife dataset contains 35K latitude-longitude GPS locations for 
182 users over 5 years. For both data sets, our observations reflect down-sampling the data to ensure 
that each individual is seen no more than once every 15 minutes. 

^ A simple account of periodicity is captured in the mixture model by observing the time of the week as one 
of the observation dimensions. 
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Figure 2: Held out predictive log likelihood for Dirichlet process mixture models on large-scale 
streamed location and text data sets. Note that we apply Gaussian likelihoods in the Geolife dataset, 
so the reported predictive performance is measured by probability density. We chose the best a for 
each population-VB curve 


Results. We compare population VB with SVI iffSl and streaming variational Bayes (SVB) HI for 
LDA Is) and DP mixtures ||23]. SVB updates the variational approximation of the global parameter 
using sequential Bayesian updating, essentially accumulating expected sufficient statistics from 
minibatches of data observed in a stream. (Here we give the hnal results. We include the details of 
how we set and ht various hyperparameters below.) 

We measure model fitness by evaluating the average predictive log likelihood on a set of held-out 
data. This involves splitting the observations of held-out data points into two equal halves, inferring 
the local component distribution based on the first half, and testing with the second half ifB) . For 
DP-mixtures, this works by predicting the location of a held-out data point by conditioning on the 
observed time of week. 

In standard offline studies, the held-out set is randomly selected from the data. With streams, however, 
we test on the next lOK documents (for New York Times, Science), 500K tweets (for Twitter), or 25K 
locations (on Geo data). This is a valid held-out set because the data ahead of the current position in 
the stream have not yet been seen by the inference algorithms. 

Figure [T] shows the performance of our algorithms for LDA. We looked at two types of streams: 
one in which the data appear in order and the other in which they have been permuted (i.e., an 
exchangeable stream). The time permuted stream reveals performance when each data minibatch is 
safely assumed to be an i.i.d. sample from F; this results in smoother improvements to predictive 
likelihood. On our data, we found that population VB outperformed SVI and SVB on two of the data 
sets and outperformed SVI on all of the data. SVB performed better than population VB on Twitter. 

Figure shows a similar study for DP mixtures. We analyzed the human mobility data and the 
New York Times. (Ref. Il23l also analyzed the New York Times.) On these data population VB 
outperformed SVB and SVI in all settings]^ 

“^Though our purpose is to compare algorithms, we make one note about a specific data set. The predictive 
accuracy for the Ivory Coast data set plummets after 14M data points. This is because of the data collection 
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Figure 3: We show the sensitivity of population-VB to hyperparameter a (based on final log 
likelihoods in the time-ordered stream) and find that the best setting of a often differs from the true 
number of data points (which may not be known in any case in practice). 


Hyperparameters. Our methods are based on stochastic optimization and require setting the 
learning rate 09). For all gradient-based procedures, we used a small fixed learning rate to follow 
noisy gradients. We note that adaptive learning rates cniEa are also applicable in this setting, 
though we did not observe an improvement using these for time-ordered streams. 

Our procedures also require setting a batch size, how many data points we observe before updating 
the approximate posterior. In the LDA study we set the batch size to 100 documents for the larger 
corpora (New York Times, Science) and 5,000 for Twitter. These sizes were selected to make the 
average number of words per batch equal in both settings, which helps lower the variance of the 
gradients. In the DP mixture study we use a batch size of 5,000 locations for Ivory Coast, 500 
locations for Geolife, and 100 documents for New York Times. 

Unlike traditional Bayesian methods, the data set size a is a hyperparameter to population VB. It 
helps control the posterior variance of the population posterior. Figure [^reports sensitivity to a for 
all studies (for the time-ordered stream). These plots indicate that the optimal setting of a is often 
different from the true number of data points; the best performing population posterior variance is 
not necessarily the one implied by the data. 

LDA requires additional hyperparameters. In line with IITtII . we used 100 topics, and set the 
hyperparameter to the global topics (which controls the word sparsity of topics) to rj = 0.01 and the 
hyperparameter to the word-topic asssignments (which controls the sparsity of topic membership for 
each word) to 7 = 0.1. (We use these hyperparameters in Eq.j^and Eq.|^) The DP mixture model 
requires a truncation hyperparameter K, which we set to 100 for all three data sets and verified that 
the number of components used after inference was less than this limit. 


policy. For privacy reasons the data set provides the cell tower locations of a randomly selected cohort of 50K 
users every 2 weeks (6|. The new cohort at 14M data points behaves differently to previous cohorts in a way that 
affects predictive performance. However, both algorithms steadily improve after this shock. 
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4 Conclusions and Future Work 


We introduced a new approach to modeling through the population posterior, a distribution over latent 
variables that combines traditional Bayesian inference and with the frequentist idea of the population 
distribution. With this idea, we derived population variational Bayes, an efficient algorithm for 
inference on streams. On two complex Bayesian models and several large data sets, we found that 
population variational Bayes usually performs better than existing approaches to streaming inference. 
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A Derivation and Bounds of the F-ELBO 

Classic variational inference seeks to minimize KL(g'(j3 ,z)\\p{j5,z\x)) using the following equiva¬ 
lence to show that the negative evidence lower bound (ELBO) is an appropriate surrogate objective to 
be minimized, 

logp(x) = KL(q'(j3,z)||p(z|x))-fE^[logp(j3,z,x)-logq'(j3,z)]. (11) 

This equivalence arises from the dehnition of KL divergence EtII . 

To derive the F-ELBO, replace x with a draw X of size a from the population distribution, X ~ Fa, 
then apply an expectation with respect to Fa to both sides of EqfTT] 

EF„[logp(X)] = Ef„[KL(^(^,z)||p(^,z|X))-fE^[log/7(^,z,X)-log^(^,z)]] 

= Ef„[KL(^(^,z)||p(^,z|X))]+Ef„[E^[logp(^,z,X)-log^(^,z)]]. (12) 

This confirms that the negative F-ELBO is a surrogate objective for E^^ [KL{q{P ,z)| |p()3,z | X))] 
because q{-) does not appear on the left hand side of Eq.[l^ 

Now use the fact that logarithm is a concave function and apply Jensen’s inequality to Eq.[^to show 
that the F-ELBO is a lower bound on the population evidence, 

Ef„[E^[logp(^,z,X)-log^(/3,z)]] < Ef„[logp(X)] 

< logEf„[p(X)]. (13) 

Additionally, Jensen’s inequality applied to Eq. [T^in a different way shows that maximizing the 
F-ELBO minimizes an upper bound on the KL divergence between q{-) and the population posterior, 

Ef„[KL(^(^,z)||p(^,z|X))] = Eq[\ogq{l5,z)]-EF„[Eq[\ogp{l5,z\X)]] 

> Eg[log?(/3,z)] - Eg[logEf„ [p{l5,z\ X)]] 

= KL(^(i3,z)||Ef„[p(i3,z|X)]), (14) 

where we have exchanged expectations with respect to q{-) and Fa- 


B One-Parameter F-ELBO 


The F-ELBO for conditionally conjugate exponential families is as follows 




E„ 


logf>(/3) - log^(^ I A)-f ^ logp(X;,Z; I ^) - log^(Z,)] 


1=1 


This can be rewritten in terms of just the global variational parameters. We dehne the one parameter 
population variational inference objective as ,5ff„(A) = maX||, (A, 0). We can write this more 
compactly if we let 0, (A) be the value of 0,- that maximizes the F-ELBO given A Formally, this 
gives 




\ogp{l5)~ logq{l5 I A) +Ex...f„ [£ Eg(z, | [logp{Xi,Zi | ^) ~ log^(Z,-)]] 


1=1 


where we have moved the expectation with respect to Fa inside the expectation with respect to q{-). 


^The optimal local variational parameter (pj can be computed using gradient ascent or coordinate ascent as 
done in 03- 
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